Background

Sample covariance

Sample covariance between variables \(x=x_t\) and \(y=y_t\):

\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]

Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):

\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]

Sample correlation

Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):

\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]

The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]

\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.

\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.

Correlation coefficients illustrated

“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012

Devore and Berk, _Modern Mathematical Statistics with Applications_, 2012, Figure 12.24

Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:

All have similar statistical properties.

set mean of x mean of y variance of x variance of y correlation intercept slope
1 9 7.5 11 4.127 0.82 3 0.5
2 9 7.5 11 4.128 0.82 3 0.5
3 9 7.5 11 4.123 0.82 3 0.5
4 9 7.5 11 4.123 0.82 3 0.5

Cross (lagged) correlations illustrated

Illustration of lag for 20.07.2013 in Lausanne:

lag_example

  • Radiation initiates the photochemistry for ozone formation and is strongly linked to the concentration of atmospheric oxidants.
  • Increase in radiation also leads to increase in temperatures, which accelerate rates of reaction.
  • The ozone production is not instantaneous, hence the lag.

Could a similar relationship be confirmed by examining correlations between the daily maximum values of radiation and ozone?

Confounding interpretations

A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:

\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \]

\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \]

Correlation does not imply causation”:

  • The variation between two variables may depend on a third, as shown above.
  • Even if the variation in one variable is dependent on another, the correlation does not indicate the direction of causality.

R demonstration

library(dplyr)
library(reshape2)
library(chron)
library(ggplot2)
source("GRB001.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots

Let us load data saved from Lesson 4.

df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")

Pairwise correlations

Before computing correlations, we will first visualize the relationships using scatter plots (or x-y plots).

Let us calculate the daily maximum values for each variable:

lf <- melt(df, measure.vars=variables)
dm <- lf %>% group_by(site, year, month, day, season, variable) %>%
  summarize(value=max(value, na.rm=TRUE))
daily.max <- dcast(dm, site + year + month + day + season ~ variable)
rm(dm) # no longer needed

Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.

ggp <- ggplot(daily.max)+
  facet_grid(site~season)+
  geom_point(aes(TEMP,O3))
print(ggp)

The corresponding correlation values can be obtained with the built-in function, cor (see `?cor).

(cor.values <- daily.max %>% group_by(site, season) %>%
   summarize(correlation=cor(TEMP, O3)))
## Source: local data frame [8 x 3]
## Groups: site [?]
## 
##    site season correlation
##   (chr) (fctr)       (dbl)
## 1   LAU    DJF  0.08360881
## 2   LAU    MAM  0.52239691
## 3   LAU    JJA  0.80040012
## 4   LAU    SON  0.52610401
## 5   ZUE    DJF  0.20129372
## 6   ZUE    MAM  0.62341179
## 7   ZUE    JJA  0.84064503
## 8   ZUE    SON  0.65368411

We can format this table and save to file:

(cor.values.wf <- dcast(cor.values, season~site, value.var="correlation"))
##   season        LAU       ZUE
## 1    DJF 0.08360881 0.2012937
## 2    MAM 0.52239691 0.6234118
## 3    JJA 0.80040012 0.8406450
## 4    SON 0.52610401 0.6536841
write.csv2(cor.values.wf, "correlations.csv", row.names=FALSE)

write.csv2 writes to a CSV (comma separated value) file using the European convention of defining delimiters as semicolons (;) rather than commas (,).

Or, we can visualize the values:

ggp <- ggplot(cor.values)+
  geom_bar(aes(season, correlation), stat="identity")+
  scale_y_continuous(limits=c(0,1))+
  facet_grid(.~site)
print(ggp)

We can examine the correlation of other pollutants or meterological variables with O3.

lf <- melt(daily.max, measure.vars=setdiff(variables, "O3")) # everything but ozone
ggp <- ggplot(lf)+
  facet_grid(site~variable, scale="free_x")+
  geom_point(aes(value, O3, group=season, color=season), shape=4)
print(ggp)

We will next look at a scatter plot between hourly measurements of CO and NO2. Why does this relationship exist?

ggp <- ggplot(df)+
  facet_grid(site~season)+
  geom_point(aes(CO, NO2))
print(ggp)

For the following scatterplot matrix, we will use a package called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (confusing, I know). We additionally define a function to include correlation coefficients in our panels.

library(lattice)

CorrelationValue <- function(x, y, ...) {
  correlation <- cor(x, y,use="pairwise")
  if(!is.na(correlation)) {
    cpl <- current.panel.limits()
    panel.text(mean(cpl$xlim),mean(cpl$ylim),
               bquote(italic(r)==.(sprintf("%.2f",correlation))),
               adj=c(0.5,0.5),col="blue")
  }
}

Plot daily maximum values as pairwise points (only Lausanne) using this syntax:

ix <- grepl("LAU", daily.max[["site"]], fixed=TRUE)
spp <- splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
             upper.panel = CorrelationValue,
             pch=4)
print(spp)

Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.

ix <- grepl("LAU", df[["site"]], fixed=TRUE)
spp <- splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
             upper.panel = CorrelationValue,
             panel = panel.smoothScatter,
             pch=4)
print(spp)

Notice some values of correlation went up.

Lagged correlations

We can define a function to generate lagged pairs for \(k\) intervals and apply it using the do() function. If we provide hourly measurements, \(k=1\) corresponds to a lag of 1 hour, and so on.

Lag <- function(pair, k) {
  out <- data.frame(lag=k, head(pair[,1],-k), tail(pair[,2],-k))
  names(out)[2:3] <- colnames(pair)
  out
}

lagged <- df %>%
  group_by(site, season) %>%
    do(rbind(Lag(.[,c("RAD","O3")], 1),
             Lag(.[,c("RAD","O3")], 2),
             Lag(.[,c("RAD","O3")], 3),
             Lag(.[,c("RAD","O3")], 4),
             Lag(.[,c("RAD","O3")], 5),
             Lag(.[,c("RAD","O3")], 6)))
ggp <- ggplot(lagged) +
  geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
    facet_grid(lag~season)
print(ggp)

R includes a function, ccf, for calculating the cross correlation. The value passed to the na.action=na.pass says to compute covariances from the complete cases when there are missing values. Calling this function will generate a corresponding plot by default (which can be turned off). See ?ccf for further details.

ix <- df[["site"]]=="LAU"
ccf(df[ix,"RAD"], df[ix,"O3"], lag.max=6, na.action=na.pass)

This tells us that the correlation is higest at a lag of -3 hours.

We can wrap this in a function that returns information that we want in a data frame (note plot=FALSE argument):

LaggedCorrelation <- function(pair, ...) {
  out <- ccf(pair[,1], pair[,2], ..., na.action=na.pass, plot=FALSE)
  data.frame(lag=out[["lag"]], value=out[["acf"]])
}

lagged.values <- df %>% group_by(site, season) %>%
  do(LaggedCorrelation(.[,c("RAD","O3")], lag.max=6))

Note the syntax, ... in the function definition This passes on any additional arguments (lag.max in this case) from the outer function (LaggedCorrelation) to the inner-most function (ccf). ?... will refer you to the Introduction to R manual for further details on this syntax.

Plot:

ggp <- ggplot(lagged.values)+
  geom_segment(aes(x=lag,xend=lag,y=0,yend=value))+
  facet_grid(site~season)+
  xlab("lag (hours)")+
  ylab("Cross correlation coefficient")
print(ggp)